1. Principles of Data Visualisation

Before we create visualisations, it’s important to understand the principles of data visualisation:

  • Clarity: Visualisations should be clear and easy to interpret.
  • Relevance: Only visualise data that is relevant to the message or story you’re trying to tell.
  • Choosing the Right Chart: Ensure you select an appropriate chart type for the data and insight you want to convey.

Common visualisation types include: - Scatter plots: For visualising the relationship between two continuous variables. - Bar plots: For comparing categorical data. - Histograms: For displaying the distribution of a numeric variable. - Line plots: For trends over time or continuous data.

Misleasding and Bad Visualisations

Misleading visualisations can distort the interpretation of data by manipulating elements like scales, axes, or colors. One common tactic is truncating the y-axis, which exaggerates small differences and makes them appear more significant than they are. Another technique is inconsistent scaling, where proportions are not accurately represented, leading to misinformed conclusions. It is essential to use clear and honest visual representations to ensure that the data is communicated accurately without introducing bias or confusion.

Examples

Lets say we want to report on whether the number of crimes has increased over two selected years.

## Warning: package 'ggplot2' was built under R version 4.3.2

From the above plot, it seems that the crime rate has jumped significanlty from 2010 to 2011. However, can you notice anything suspicious about the plot?

Now, if we set the scale to start at zero we get:

Plot 1

Now we can see that in reality, the crime rate has only increased marginally. This is a common tactic used in politics and the news when reporting. For example:

Tax rate as reported on Fox news. Left bar is 35%. Right bar is 39.6%
Tax rate as reported on Fox news. Left bar is 35%. Right bar is 39.6%

Not All Bad

However, a truncated scale is not always a bad thing. In the example below, we show the IQ for three individuals. Setting the scale at zero, it appears that all three people have a very similar IQ:

However, since IQ is sensitive to small changes, it makes more sense to view the differences on a smaller scale to get a true picture of the differences between the people:

Example 2

A famous example of a misleading plot was published by Reuters news agency on a report about gun safety in Florida. What can you notice that is unusual about this plot?

Gun deaths in Florida
Gun deaths in Florida
ANSWER At first glance, the graph suggests a sharp drop in gun deaths after Florida enacted its ‘stand your ground’ law in 2005, because we’re used to seeing the y-axis start at zero. However, this plot is upside-down and the y-axis is reversed. In reality, gun deaths increased from around 550 to 850 between 2005 and 2007.

See more: https://www.nytimes.com/column/whats-going-on-in-this-graph

2. Starting the Session: Clear the Environment and Load a Dataset

Before we begin creating visualisations, let’s start by clearing the R environment and loading a dataset that contains a variety of column types.

Clearing the Environment

To ensure we start fresh, let’s clear the environment by removing all existing objects.

# Clear the environment
rm(list = ls())

Loading the diamonds Dataset

We will use the built-in diamonds dataset from the ggplot2 package, which contains 53,940 rows and 10 columns. This dataset includes both numeric and factor columns, making it suitable for various types of visualisation.

First, let’s load the ggplot2 package and then the diamonds dataset.

# Load the ggplot2 package
library(ggplot2)

# Load the diamonds dataset
data(diamonds)

# View the first few rows of the dataset
head(diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
# Check the structure of the dataset to see the different column types
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Explanation:
- data(diamonds): Loads the diamonds dataset, which contains 53,940 observations and 10 variables related to diamonds, including price, carat, cut, color, and clarity. - head(diamonds): Displays the first 6 rows of the dataset. - str(diamonds): Shows the structure of the dataset, with several factors (e.g., cut, color, clarity) and numeric columns (e.g., price, carat).

As we can see, some of the variables have an ordered factors structure. An ordered factor is a special type of factor where the levels have a meaningful order or hierarchy. This is used when the categories can be ranked or ordered in a logical sequence, such as “low”, “medium”, and “high”.

Key Differences Between Regular Factor and Ordered Factor

1.  Order of Levels:
•   Regular Factor: The levels are unordered, meaning that R does not assume any relationship between the categories.
•   Example: Colors (red, blue, green) don’t have a natural order, so they would be a regular factor.
•   Ordered Factor: The levels have a specific order, and R will treat them as ranked.
•   Example: Education levels (high school, bachelor's, master's, PhD) have a natural order, so they would be an ordered factor.
2.  Comparison:
•   Regular Factor: Categories cannot be compared using greater than (>) or less than (<) operators. Trying to do so will return an error.
•   Ordered Factor: Since the levels have a meaningful order, they can be compared using greater than (>) or less than (<) operators.
•   For example, bachelor's > high school would return TRUE for an ordered factor.

3. Basic Plotting in Base R

Let’s begin by creating some basic plots using base R to understand fundamental visualisation techniques. We’ll use the diamonds dataset.

3.1 Scatter Plot

A scatter plot shows the relationship between two continuous variables. Let’s create a scatter plot between carat (diamond size) and price.

# Scatter plot of carat vs price
plot(diamonds$carat, diamonds$price, 
     main = "Scatter Plot of Carat vs Price", 
     xlab = "Carat", 
     ylab = "Price", 
     col = "blue", 
     pch = 19)

Explanation:
The plot() function is used to create a scatter plot, with carat on the x-axis and price on the y-axis. The points are colored blue (col = "blue") and use solid circles (pch = 19).


3.2 Bar Plot

A bar plot visualises the frequency of categories in a factor variable. Let’s create a bar plot for the cut variable, which represents the quality of the diamond’s cut.

# Bar plot of the frequency of diamond cut
barplot(table(diamonds$cut), 
        main = "Bar Plot of Diamond Cut", 
        xlab = "Cut", 
        ylab = "Frequency", 
        col = "lightblue")

Explanation:
We use table() to count the frequency of each level in the cut variable and barplot() to visualise the distribution of cut quality. The bars are colored light blue (col = "lightblue").


3.3 Histogram

A histogram helps visualise the distribution of a continuous variable. Let’s create a histogram for the price of diamonds.

# Histogram of diamond prices
hist(diamonds$price, 
     main = "Histogram of Diamond Prices", 
     xlab = "Price", 
     col = "orange", 
     border = "black")

Explanation:
The hist() function creates a histogram of price to show its distribution. The bars are colored orange (col = "orange"), and border = "black" adds black borders around the bars.


3.4 Line Plot

Though the diamonds dataset doesn’t include a time series, a line plot can still be used to show trends. Here, we’ll plot the average price of diamonds by carat size.

# Line plot of average price by carat
avg_price <- aggregate(price ~ carat, data = diamonds, FUN = mean)
plot(avg_price$carat, avg_price$price, 
     type = "l", 
     main = "Line Plot of Average Price by Carat", 
     xlab = "Carat", 
     ylab = "Average Price", 
     col = "blue", 
     lwd = 2)

Explanation:
We use aggregate() to calculate the mean price for each carat size and plot it using plot() with type = "l" to create a line plot. The line is blue (col = "blue") and its width is increased with lwd = 2.


4. Introduction to ggplot2

The ggplot2 package is a powerful and flexible visualisation tool in R. It follows the grammar of graphics, where plots are built layer by layer, allowing for complex and highly customised visualisations.

4.1 Structure of a ggplot

A basic ggplot consists of the following components: - Data: The dataset to visualise. - Aesthetic Mappings: Map data to visual properties (e.g., x, y, color). - Geometries (Geoms): The type of plot to create (e.g., points, bars, lines). - Layers: Additional layers such as titles, labels, themes.


4.2 Scatter Plot with ggplot2

Let’s create a scatter plot of carat vs price, similar to the base R example, but using ggplot2.

# Scatter plot of carat vs price using ggplot2
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(color = "blue") +
  labs(title = "Scatter Plot of Carat vs Price", x = "Carat", y = "Price")

Explanation:
In this example, we use ggplot() to define the data and aesthetics (aes()). The geom_point() function adds points to create a scatter plot, and labs() is used to add a title and axis labels.


4.3 Bar Plot with ggplot2

Let’s recreate the bar plot of the cut variable using ggplot2.

# Bar plot of diamond cut using ggplot2
ggplot(data = diamonds, aes(x = cut)) +
  geom_bar(fill = "lightblue") +
  labs(title = "Bar Plot of Diamond Cut", x = "Cut", y = "Frequency")

Explanation:
We use geom_bar() to create a bar plot of the cut variable. The fill argument sets the color of the bars, and labs() is used to add the title and axis labels.


4.4 Histogram with ggplot2

Let’s create a histogram of price using ggplot2.

# Histogram of diamond prices using ggplot2
ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 1000, fill = "orange", color = "black") +
  labs(title = "Histogram of Diamond Prices", x = "Price", y = "Count")

Explanation:
We use geom_histogram() to create a histogram of the price variable. The binwidth parameter controls the width of the bars, and fill and color set the bar colors.


4.5 Customising Plots in ggplot2

Customising plots in ggplot2 allows you to tailor the appearance of your visualisations. You can adjust elements such as titles, axis labels, themes, scales, and even create subplots (faceting).

4.5.1 Adding Titles, Axis Labels, and Captions

You can use labs() to add or modify titles, axis labels, and captions for your plots.

# Customising titles and axis labels
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(color = "blue") +
  labs(
    title = "Scatter Plot of Carat vs Price",
    subtitle = "Data from the diamonds dataset",
    x = "Carat (size of diamond)", 
    y = "Price (in US dollars)",
    caption = "Source: ggplot2 diamonds dataset"
  )

Explanation:
- title and subtitle: Add a main title and a subtitle to the plot. - x and y: Customise the axis labels. - caption: Add a caption at the bottom of the plot.


4.5.2 Adjusting Themes

ggplot2 provides several pre-built themes that can change the overall look of your plots. You can also customise these themes further.

# Applying different themes
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(color = "blue") +
  labs(title = "Scatter Plot with Custom Theme") +
  theme_bw() # Apply the bw theme

Common themes in ggplot2 include: - theme_minimal(): A clean, minimalistic theme. - theme_classic(): A theme with classic x and y axis lines. - theme_bw(): A black and white theme.

Explanation:
Themes allow you to quickly change the appearance of your plot without manually customising each element. You can also modify individual elements of a theme using theme().


4.5.3 Modifying Scales

You can adjust scales for axes, colours, and sizes to better represent your data. This includes modifying axis limits, colour schemes, and scale transformations (e.g., log scale).

# Scatter plot with customised scales
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut)) +  # Colour points by cut
  labs(title = "Price vs Carat Coloured by Cut") +
  scale_x_continuous(limits = c(0, 3)) +  # Set limits for the x-axis
  scale_color_brewer(palette = "Dark2") +  # Apply a custom colour palette
  theme_bw() 
## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).

Explanation:
- scale_x_continuous(): Sets the limits for the x-axis (in this case, carat is limited to between 0 and 3). - scale_color_brewer(): Changes the colour scheme using a pre-defined palette from the RColorBrewer package.


4.5.4 Faceting for Subplots

Faceting allows you to create multiple plots based on a categorical variable, essentially creating subplots that share the same axes and scales.

# Facet scatter plot by cut
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(aes(color = cut)) +
  labs(title = "Faceted Plot of Price vs Carat by Cut") +
  facet_wrap(~cut) + # Create a facet for each level of cut
  theme_bw() 

Explanation:
- facet_wrap(~cut): Creates a separate scatter plot for each level of the cut variable, arranging them in a grid format.

Faceting is useful for comparing how a relationship changes across different categories (e.g., different cuts of diamonds).


5. Advanced Plot Features

5.1 Adding Annotations

Annotations allow you to add text, arrows, or shapes to highlight specific points or areas of interest in your plots. You can use annotate() to add annotations to your ggplot2 visualisations.

# Scatter plot with annotation
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(color = "blue") +
  labs(title = "Annotated Scatter Plot of Carat vs Price") +
  annotate("text", x = 1, y = 15000, label = "High Value", color = "black", size = 10) +  # Add a text annotation
  annotate("rect", xmin = 1, xmax = 1.5, ymin = 5000, ymax = 10000, alpha = 0.5, fill = "yellow") + # Highlight an area
  theme_bw()

Explanation:
- annotate("text"): Adds text at the specified x and y coordinates. - annotate("rect"): Draws a shaded rectangle to highlight a region of the plot, with adjustable transparency using alpha.


5.2 Adding Error Bars

Error bars are used to represent the uncertainty or variability of the data. In ggplot2, you can add error bars to line or bar plots using geom_errorbar().

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Bar plot with error bars
avg_price <- diamonds %>%
  group_by(cut) %>%
  summarise(mean_price = mean(price), sd_price = sd(price))  # Calculate mean and standard deviation

ggplot(avg_price, aes(x = cut, y = mean_price, fill = cut)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price), width = 0.2) +
  labs(title = "Bar Plot of Mean Price by Cut with Error Bars", x = "Cut", y = "Mean Price") +
  theme_bw()

Explanation:
- geom_errorbar(): Adds vertical error bars to represent the standard deviation of the mean price for each cut. - ymin and ymax: Set the lower and upper limits of the error bars using the mean price ± standard deviation.


5.3 Removing the Legend

To remove the legend, we can adjust it’s setting via the theme() function

library("dplyr")
# Bar plot with error bars
avg_price <- diamonds %>%
  group_by(cut) %>%
  summarise(mean_price = mean(price), sd_price = sd(price))  # Calculate mean and standard deviation

ggplot(avg_price, aes(x = cut, y = mean_price, fill = cut)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price), width = 0.2) +
  labs(title = "Bar Plot of Mean Price by Cut with Error Bars", x = "Cut", y = "Mean Price") +
  theme_bw() +
  theme(legend.position = 'none') # Removes the legend

5.4 Working with Multiple Layers

One of the strengths of ggplot2 is its ability to add multiple layers to a single plot. This allows you to combine different geometries, such as points and lines, on the same plot.

# Scatter plot with a regression line and points
ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(color = "blue", alpha = 0.5) +  # Add points with transparency
  geom_smooth(method = "lm", color = "red", se = FALSE) +  # Add a regression line
  labs(title = "Scatter Plot of Carat vs Price with Multiple Layers", x = "Carat", y = "Price") +
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

Explanation:
- geom_point(): Plots the scatter points. - geom_smooth(): Adds a regression line (linear model). - By layering multiple geoms, you can easily combine different visualisation elements into a single plot.


These two sections expand upon customisation options in ggplot2, allowing you to create more informative and professional visualisations. Let me know if you need further modifications!

6. Saving and Exporting Plots

Once you’ve created a plot, you can save it to a file using ggsave().

# Save the last plot as a PNG file
ggsave("scatter_plot.png", width = 7, height = 5)

Explanation:
The ggsave() function saves the last plot created. You can specify the filename, file format, and dimensions of the output.


7. Interactive Graphics with plotly

Interactive visualisations allow users to explore the data by hovering over points, zooming in, and panning around the plot. The plotly package can easily turn static ggplot2 plots into interactive graphics.

7.1 Installing and Loading plotly

First, you need to install and load the plotly package.

# Install plotly
#install.packages("plotly")

# Load the plotly library
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

7.2 Converting a ggplot2 Plot to an Interactive Plot

To make a ggplot2 plot interactive, simply pass the plot object to the ggplotly() function.

Example: Interactive Scatter Plot

Let’s convert a scatter plot of carat vs price from the diamonds dataset into an interactive plot.

# Create a ggplot2 scatter plot
p <- ggplot(data = diamonds, aes(x = carat, y = price)) +
  geom_point(color = "blue") +
  labs(title = "Interactive Scatter Plot of Carat vs Price", x = "Carat", y = "Price") +
  theme_bw()

# Convert to interactive plot using ggplotly
ggplotly(p)

Explanation:
- ggplotly(p): Converts the ggplot2 plot p into an interactive plot. You can now hover over points to see their values and zoom in/out of the plot.


7.3 Interactive Bar Plot

Let’s convert a bar plot of the cut variable from the diamonds dataset into an interactive bar plot.

# Create a ggplot2 bar plot
p_bar <- ggplot(data = diamonds, aes(x = cut, fill = cut)) +
  geom_bar() +
  labs(title = "Interactive Bar Plot of Diamond Cut", x = "Cut", y = "Count") +
  theme_bw() +
  theme(legend.position = 'none')


# Convert to interactive plot
ggplotly(p_bar)

Explanation:
This converts a bar plot into an interactive format. You can hover over the bars to see the counts of each cut category.


Summary

The plotly package allows you to easily convert static ggplot2 plots into interactive visualisations. By simply using the ggplotly() function, your plots become interactive, enabling zooming, panning, and hovering over points to display additional information.